“You are a scout for the worst team in the NBA, probably the Wizards. Your general manager just heard about Data Science and thinks it can solve all the teams problems! She wants you to figure out a way to find players that are high performing but maybe not highly paid that you can steal to get the team to the playoffs!”
This site will showcase my data-driven approach for recommending players for acquisition an NBA team. I am using a two-stage approach that will combine an unsupervised machine learning clustering approach and a supervised machine learning regression model to make educated predictions about which high performing players are underpaid and thus ideal targets for acquisition.
Conduct unsupervised machine learning k-means clustering. This will take all relevant features into account and produce another feature, cluster, which will eventually aid in producing a more accurate supervised machine learning regression model. In order to decide the ideal number of clusters to use for this dataset, I will use a function to evaluate explained variance over a range of number of clusters in order to reveal which number of clusters maximizes explained variance while minimizing complexity
With clustering complete, I will turn to produce a 3d visualization that will show players that are performing highly amongst the stats most closely correlated with salary in order to reveal the high-performing players that are underpaid relative to their peers. In order to identify the features that are the most correlated with salary, I will develop a correlogram between all of the relevant features in the dataset.
I will then develop a supervised machine learning regression model to make predictions on what a player should be earning considering their performance stats. I will evaluate different regression models such as rpart2 decision tree regression and a generalized linear model to see which model produces the most accurate predictions. Equipped with salary predictions, I will investigate the players who had high performance metrics as seen from the 3d visualization and see if the models predicts that these players are underpaid.
I will be using a dataset of 401 NBA players throughout the 2020-2021 season that includes the following information and stats:
For this dataset, I removed players from consideration who had incomplete stat reports. Removing players with NA values for some of their information took 36 players out of consideration. Considering that 401 out of the original 437 players were still included to inform the models and visualizations and be considered as candidates for acquisition, removing players with incomplete sets of stats was not a decision that rendered this dataset useless.
As I selected the variables to be considered for consideration in the models, I removed variables that would not provide value to the model or could not be processed such as the name of the players (Player) and the name of the teams (Tm). Columns that referenced shooting data columns - made shots and attempted shots - were removed as the shooting percentage stats captured that data. As I produced the initial clustering model, position had to be removed from consideration as the kmeans clustering approach that I employed cannot process categorical data.
With the data cleaned and prepared, the first thing that I did was use the data in a k-means clustering model. Based on the features (variables) in consideration, K-means clustering assigns each player to a cluster in an effort to sort (basically categorize) similar data together. This provides value when I go to make a supervised machine learning approach as the information about what cluster each player is assigned to can be used as a new feature that could be associated with their salary and help the model in making more accurate predictions.
Using this elbow plot, I visualized the explained variance metric that would be outputted if you run a k-means clustering approach on this data with different k values (number of clusters) The point of inflection on this elbow plot exists at k = 3 so the ideal number of clusters for this dataset is 3.
At this point, I also created a correlogram which shows which of a player’s stats are most correlated with their salary.
This correlogram suggested that Assists (AST), Points (PTS), and Turnovers (TOV) were the three variables most correlated with predicting a player’s salary. As a result, these were the three variables that I selected to visualize a players and their salaries in a 3D visualization.
As these three variables are the best individual predictors of a players salary, I graphed them expecting to find the players with the best stats across these variables as the ones who would be earning the highest salary. However, I also expected to find players who are high-performing across these three crucial stats that were compensated significantly less than players of similar caliber, and these would be targets for acquisition that should be given further consideration. To see the discrepancies between different players’ salaries, I plotted a player’s salary as the size of their plotted point. The idea being that a player with a small circle amongst players with much larger circles would be a player that is paid significantly less than other players of a comparable caliber.
## null device
## 1
This visualization provoked interest in several players, specifically Trae Young, Donovan Mitchell, DeAaron Fox, Bam Adebayo, Shai Gilgeous Alexander, and LaMelo Ball.
After examining this visualization and equipped with cluster data from my initial k-means clustering, I then implemented a supervised machine learning regression approach to further examine the relationship between performance and compensation in order to predict who would be the most cost-efficient players to acquire. This model would consider all of their stats and the cluster that they were assigned to in the earlier k-means clustering model.
Evaluating the performance metrics RMSE, Rsquared, and MAE while changing the hyperparameter maxdepth in order to identify the maxdepth level that maximizes performance while minimizing complexity
## CART
##
## 281 samples
## 28 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 253, 253, 253, 253, 252, 253, ...
## Resampling results across tuning parameters:
##
## maxdepth RMSE Rsquared MAE
## 1 5107099 0.7235741 4109691
## 2 3140121 0.9075477 2416539
## 3 3203877 0.9056782 2444173
## 4 3203877 0.9056782 2444173
## 5 3203877 0.9056782 2444173
## 6 3203877 0.9056782 2444173
## 7 3203877 0.9056782 2444173
## 8 3203877 0.9056782 2444173
## 9 3203877 0.9056782 2444173
## 10 3203877 0.9056782 2444173
## 11 3203877 0.9056782 2444173
## 12 3203877 0.9056782 2444173
## 13 3203877 0.9056782 2444173
## 14 3203877 0.9056782 2444173
## 15 3203877 0.9056782 2444173
## 16 3203877 0.9056782 2444173
## 17 3203877 0.9056782 2444173
## 18 3203877 0.9056782 2444173
## 19 3203877 0.9056782 2444173
## 20 3203877 0.9056782 2444173
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was maxdepth = 2.
Building a generalized linear model to see if it provides better performance than the rpart2 model
## Generalized Linear Model
##
## 281 samples
## 28 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 253, 253, 253, 253, 252, 253, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 2808440 0.9233609 2105813
The Generalized Linear Model (glm) provides lower RMSE and MAE values and a higher Rsquared value which means that the glm model is a more accurate predictor of salary than the rpart2 model.
Evaluating variable importance of the generalized linear model
## glm variable importance
##
## only 20 most important variables shown (out of 30)
##
## Overall
## cluster2 100.000
## cluster3 76.643
## Age 22.578
## G 17.290
## DRB 10.618
## AST 10.457
## BLK 9.853
## ORB 9.103
## FGA 7.339
## X3P 7.237
## FG 7.201
## X3PA 6.412
## PosPF 4.083
## TOV 3.951
## PosPG-SG 3.713
## PF 2.949
## FT. 2.614
## STL 2.415
## PosSG 1.882
## GS 1.667
Performance Metrics From Generalized Linear Model
## RMSE Rsquared MAE
## 2.535178e+06 9.337872e-01 1.881484e+06
I developed a few machine learning regression models, and I ultimately chose to proceed with a generalized linear regression model. This model produced the following metrics:
With these metrics showing that the model is performing well, I then used the model to make predictions on what a player’s salary should be based on their stats. By subtracting a player’s actual salary from their predicted salary, I developed a column (pred_vs_obs_residual) that could then be filtered on to identify the players who are the most underpaid according to the model.
From the 3D visualization, I became interested in Trae Young, Donovan Mitchell, DeAaron Fox, Bam Adebayo, Shai Gilgeous Alexander, and LaMelo Ball as they had high performance markers and appeared to be significantly underpaid relative to their peers. With interests in these players established, I then looked at the salary predictions that my supervised ML regression model made to see which ones would be the most cost-effective to acquire.
Our generalized linear model predicted that Trae Young would earn $10723726 but during the 2020-2021 season he was only paid $6,571,800. Our metrics of error for our generalized model recognizes that the average error of our predictions is approximately $2.5 million or $1.8 million (depending on whether you use RMSE or MAE). Even if you consider the possibility that the model over-predicted Trae Young’s salary by the average error according to RMSE, Trae Young would still be earning more than this value. Amongst the players in consideration, I am the most confident that Trae Young is being underpaid, so signing him is a great opportunity to gain a high-caliber player for less money.
Trae Young plotted in the 3d visualization of Assists, Points, Turnovers, and Salary
Looking at the 3d model, Donovan Mitchell is another player that we would expect to be underpaid, and our generalized linear model confirms this. Our generalized linear model predicted that Donovan Mitchell would earn $8,092,526 but during the 2020-2021 season he was only paid $5,195,501.
Donovan Mitchell plotted in the 3d visualization of Assists, Points, Turnovers, and Salary
Looking at the 3d model, Shai Gilgeous Alexander is another player that we would expect to be underpaid, and our generalized linear model confirms this. Our generalized linear model predicted that Donovan Mitchell would earn $7,311,061 but during the 2020-2021 season he was only paid $4,141,320.
Shai Gilgeous Alexander plotted in the 3d visualization of Assists, Points, Turnovers, and Salary